This notebook explores the more recent data from NYC Open Data Data Set.
This dataset can also be reached and interacted with through its Google BigQuery location
Try a different version: handout page, Tab Navigation
Reduce Car Accidents in Brooklyn
For this exercise, we’d like you to analyze data on New York motor vehicle collisions and answer the following question:
What are your ideas for reducing accidents in Brooklyn?
Imagine you are preparing this presentation for the city council who will use it to inform new legislation and/or projects.
Briefly:
Libraries that will be used during exploration
library(magrittr)
library(dplyr)
library(ggplot2)
library(viridis)
library(plotly)
library(maps)
library(rgeos)
library(rgdal)
library(ggthemes)
library(crosstalk)
library(leaflet)
library(d3scatter)
library(d3heatmap)
library(rnoaa)
#library(ggmap)
Collect API tokens in Environment Variables (purposefully kept hidden here). Tokens and keys used include Google Maps API key (get one here, Mapbox Access Token(get one here) and an NCDC token (here) for NOAA weather data.
Load data from /data directory and into memory
dt <- read.csv(file = "data/NYPD_Motor_Vehicle_Collisions.csv")
Inspect structure of dataset with the str() command:
str(dt)
'data.frame': 990800 obs. of 29 variables:
$ DATE : Factor w/ 1709 levels "01/01/2013","01/01/2014",..: 200 200 200 200 1068 920 611 200 65 65 ...
$ TIME : Factor w/ 1440 levels "0:00","0:01",..: 556 631 632 644 661 936 46 686 1051 1066 ...
$ BOROUGH : Factor w/ 6 levels "","BRONX","BROOKLYN",..: 1 2 2 3 1 1 1 1 3 3 ...
$ ZIP.CODE : int NA 10454 10466 11218 NA NA NA NA 11218 11236 ...
$ LATITUDE : num 40.7 40.8 40.9 40.6 40.7 ...
$ LONGITUDE : num -73.9 -73.9 -73.9 -74 -73.9 ...
$ LOCATION : Factor w/ 90272 levels "","(0.0, 0.0)",..: 28545 73165 89328 17808 52600 1 1 14824 17763 18379 ...
$ ON.STREET.NAME : Factor w/ 9151 levels "","?EST 125 STREET",..: 1420 1 3493 1 1 1 6598 4069 738 7071 ...
$ CROSS.STREET.NAME : Factor w/ 9585 levels "","0","01247",..: 1 1 9364 1 1 1 7399 3638 114 4201 ...
$ OFF.STREET.NAME : Factor w/ 59908 levels "","(26 BROOKLYN TERMINAL MARKET LOT)",..: 1 38225 1 29898 1 1 1 1 1 1 ...
$ NUMBER.OF.PERSONS.INJURED : int 0 0 1 0 0 0 0 0 1 2 ...
$ NUMBER.OF.PERSONS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
$ NUMBER.OF.PEDESTRIANS.INJURED: int 0 0 1 0 0 0 0 0 0 0 ...
$ NUMBER.OF.PEDESTRIANS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
$ NUMBER.OF.CYCLIST.INJURED : int 0 0 0 0 0 0 0 0 0 0 ...
$ NUMBER.OF.CYCLIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
$ NUMBER.OF.MOTORIST.INJURED : int 0 0 0 0 0 0 0 0 1 2 ...
$ NUMBER.OF.MOTORIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
$ CONTRIBUTING.FACTOR.VEHICLE.1: Factor w/ 49 levels "","Accelerator Defective",..: 10 47 47 47 47 11 1 43 43 43 ...
$ CONTRIBUTING.FACTOR.VEHICLE.2: Factor w/ 49 levels "","Accelerator Defective",..: 47 1 1 47 47 47 1 47 47 47 ...
$ CONTRIBUTING.FACTOR.VEHICLE.3: Factor w/ 43 levels "","Accelerator Defective",..: 1 1 1 1 1 1 1 1 42 1 ...
$ CONTRIBUTING.FACTOR.VEHICLE.4: Factor w/ 42 levels "","Accelerator Defective",..: 1 1 1 1 1 1 1 1 1 1 ...
$ CONTRIBUTING.FACTOR.VEHICLE.5: Factor w/ 31 levels "","Aggressive Driving/Road Rage",..: 1 1 1 1 1 1 1 1 1 1 ...
$ UNIQUE.KEY : int 3612721 3612791 3618743 3614471 3284922 2833714 336679 3618925 3598095 3597360 ...
$ VEHICLE.TYPE.CODE.1 : Factor w/ 18 levels "","AMBULANCE",..: 15 10 12 15 10 10 1 10 10 15 ...
$ VEHICLE.TYPE.CODE.2 : Factor w/ 18 levels "","AMBULANCE",..: 10 1 1 10 16 10 1 10 10 15 ...
$ VEHICLE.TYPE.CODE.3 : Factor w/ 18 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 15 1 ...
$ VEHICLE.TYPE.CODE.4 : Factor w/ 18 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 1 1 ...
$ VEHICLE.TYPE.CODE.5 : Factor w/ 16 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 1 1 ...
Inspect summary of dataset with summary() command:
summary(dt)
DATE TIME BOROUGH ZIP.CODE
01/21/2014: 1161 16:00 : 12792 :260725 Min. :10000
01/18/2015: 960 15:00 : 12748 BRONX : 95396 1st Qu.:10075
02/03/2014: 960 17:00 : 12597 BROOKLYN :223552 Median :11205
03/06/2015: 936 18:00 : 11641 MANHATTAN :187571 Mean :10808
01/07/2017: 887 14:00 : 11094 QUEENS :189619 3rd Qu.:11236
09/30/2016: 872 13:00 : 10365 STATEN ISLAND: 33937 Max. :11697
(Other) :985024 (Other):919563 NA's :260826
LATITUDE LONGITUDE LOCATION
Min. : 0.00 Min. :-201.36 :201443
1st Qu.:40.67 1st Qu.: -73.98 (40.6960346, -73.9845292): 673
Median :40.72 Median : -73.93 (40.7606005, -73.9643142): 544
Mean :40.72 Mean : -73.92 (40.7572323, -73.9897922): 485
3rd Qu.:40.77 3rd Qu.: -73.87 (40.6757357, -73.8968533): 480
Max. :40.91 Max. : 0.00 (40.6585778, -73.8906229): 464
NA's :201443 NA's :201443 (Other) :786711
ON.STREET.NAME CROSS.STREET.NAME
:188246 :217648
BROADWAY : 10832 3 AVENUE: 11407
ATLANTIC AVENUE : 9354 BROADWAY: 11088
NORTHERN BOULEVARD: 7490 2 AVENUE: 9678
3 AVENUE : 6864 5 AVENUE: 7846
FLATBUSH AVENUE : 6500 7 AVENUE: 7312
(Other) :761514 (Other) :725821
OFF.STREET.NAME NUMBER.OF.PERSONS.INJURED
:916764 Min. : 0.0000
PARKING LOT 110-00 ROCKAWAY BOULEVARD : 150 1st Qu.: 0.0000
PARKING LOT-772 EDGEWATER RD : 91 Median : 0.0000
PARKING LOT OF 110-00 ROCKAWAY BOULEVARD: 90 Mean : 0.2552
3 AVENUE : 72 3rd Qu.: 0.0000
2 AVENUE : 67 Max. :43.0000
(Other) : 73566
NUMBER.OF.PERSONS.KILLED NUMBER.OF.PEDESTRIANS.INJURED NUMBER.OF.PEDESTRIANS.KILLED
Min. :0.000000 Min. : 0.00000 Min. :0.0000000
1st Qu.:0.000000 1st Qu.: 0.00000 1st Qu.:0.0000000
Median :0.000000 Median : 0.00000 Median :0.0000000
Mean :0.001214 Mean : 0.05455 Mean :0.0006833
3rd Qu.:0.000000 3rd Qu.: 0.00000 3rd Qu.:0.0000000
Max. :5.000000 Max. :15.00000 Max. :2.0000000
NUMBER.OF.CYCLIST.INJURED NUMBER.OF.CYCLIST.KILLED NUMBER.OF.MOTORIST.INJURED
Min. :0.00000 Min. :0.00e+00 Min. : 0.0000
1st Qu.:0.00000 1st Qu.:0.00e+00 1st Qu.: 0.0000
Median :0.00000 Median :0.00e+00 Median : 0.0000
Mean :0.02093 Mean :7.47e-05 Mean : 0.1927
3rd Qu.:0.00000 3rd Qu.:0.00e+00 3rd Qu.: 0.0000
Max. :6.00000 Max. :1.00e+00 Max. :43.0000
NUMBER.OF.MOTORIST.KILLED CONTRIBUTING.FACTOR.VEHICLE.1
Min. :0.000000 Unspecified :523736
1st Qu.:0.000000 Driver Inattention/Distraction:127688
Median :0.000000 Fatigued/Drowsy : 48249
Mean :0.000463 Failure to Yield Right-of-Way : 42948
3rd Qu.:0.000000 Other Vehicular : 30393
Max. :5.000000 Backing Unsafely : 27886
(Other) :189900
CONTRIBUTING.FACTOR.VEHICLE.2 CONTRIBUTING.FACTOR.VEHICLE.3
Unspecified :738985 :925696
:123724 Unspecified : 59537
Driver Inattention/Distraction: 37843 Other Vehicular : 1225
Other Vehicular : 17711 Fatigued/Drowsy : 1122
Fatigued/Drowsy : 13016 Driver Inattention/Distraction: 1100
Failure to Yield Right-of-Way : 9087 Pavement Slippery : 234
(Other) : 50434 (Other) : 1886
CONTRIBUTING.FACTOR.VEHICLE.4 CONTRIBUTING.FACTOR.VEHICLE.5
:976717 :987360
Unspecified : 12938 Unspecified : 3186
Fatigued/Drowsy : 222 Other Vehicular : 52
Other Vehicular : 221 Fatigued/Drowsy : 48
Driver Inattention/Distraction: 192 Driver Inattention/Distraction: 36
Pavement Slippery : 67 Pavement Slippery : 23
(Other) : 443 (Other) : 95
UNIQUE.KEY VEHICLE.TYPE.CODE.1
Min. : 22 PASSENGER VEHICLE :579372
1st Qu.: 249509 SPORT UTILITY / STATION WAGON:218537
Median :3131520 TAXI : 37190
Mean :2054070 VAN : 26511
3rd Qu.:3379220 OTHER : 24699
Max. :3627969 UNKNOWN : 20713
(Other) : 83778
VEHICLE.TYPE.CODE.2 VEHICLE.TYPE.CODE.3
PASSENGER VEHICLE :438701 :926878
SPORT UTILITY / STATION WAGON:165455 PASSENGER VEHICLE : 38181
:134997 SPORT UTILITY / STATION WAGON: 15761
UNKNOWN : 80864 UNKNOWN : 3240
TAXI : 31205 VAN : 1401
OTHER : 25249 TAXI : 1163
(Other) :114329 (Other) : 4176
VEHICLE.TYPE.CODE.4 VEHICLE.TYPE.CODE.5
:977085 :987436
PASSENGER VEHICLE : 8441 PASSENGER VEHICLE : 2072
SPORT UTILITY / STATION WAGON: 3553 SPORT UTILITY / STATION WAGON: 958
UNKNOWN : 583 UNKNOWN : 94
VAN : 248 OTHER : 52
OTHER : 205 VAN : 50
(Other) : 685 (Other) : 138
Our Dataset structure revealed the variables and their classes sapply(names(dt), function(x) paste0(x, ' is class: ', class(dt[[x]])))
The first thing to come to mind with such a factor heavy dataset is counting. Factors are not a hodgepodge collection of values, observed as they are pick up off the ground. Each Level of a factor - ideally - should have been intentionally designed. Ordered and distributed according to a purpose greater than the unit. Though not statically related as quantifiable assets, the levels in a factor are each related to one and other in addition to the group as a whole.
There are a lot of empty cells. To make sure we use a universal value for blank or Not Available we will assign the value NA to all blank cells. While munging around with the data, add what could be a valuable variable created from two current variables. The DATE and TIME variables are set up as factors. This has interesting categorical value so they will stay in the data. Rather than replace the two rows, add a third one in a POSIX date form.
getwd()
[1] "/Users/irJERAD/Documents/Data-Apps/Interview-exercise/nyc-mv-collisions"
With Latitude and Longitude present and appearing to be fairly well documented, let’s take a quick look at how these accidents look over an interactive world map (incase of mistakes outlying somewhere aside from New York). We will use the BOROUGH variable as a factor. This gives the geographic association of each borough and allows us early forsight into anything specific about our point of interest BOURGH == "BROOKLYN"
mp <- dt %>%
plot_mapbox(lat = ~LATITUDE, lon = ~LONGITUDE,
split = ~BOROUGH, mode = 'scattermapbox') %>%
layout(mapbox = list(zoom = 9,
center = list(lat = ~(40.7), lon = ~(-74.0))))
plotly_build(mp)